RankMass Crawler: A Crawler with High PageRank Coverage Guarantee
نویسندگان
چکیده
Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the infinite number of pages available on the Web, search-engine operators constantly struggle with the following vexing questions: When can I stop downloading the Web? How many pages should I download to cover “most” of the Web? How can I know I am not missing an important part when I stop? In this paper we provide an answer to these questions by developing a family of crawling algorithms that (1) provide a theoretical guarantee on how much of the “important” part of the Web it will download after crawling a certain number of pages and (2) give a high priority to important pages during a crawl, so that the search engine can index the most important part of the Web first. We prove the correctness of our algorithms by theoretical analysis and evaluate their performance experimentally based on 141 million URLs obtained from the Web. Our experiments demonstrate that even our simple algorithm is effective in downloading important pages early on and provides high “coverage” of the Web with a relatively small number of pages.
منابع مشابه
RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee
Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web, search-engine operators constantly struggle with the following vexing questions: When can I stop downloading the Web? How many pages should I download to cover “most” of the Web? How can I know I am ...
متن کاملThe Implementation of Hadoop-based Crawler System and Graphlite-based PageRank-Calculation In Search Engine
Nowadays, the size of the Internet is experiencing rapid growth. As of December 2014, the number of global Internet websites has more than 1 billion and all kinds of information resources are integrated together on the Internet , however,the search engine is to be a necessary tool for all users to retrieve useful information from vast amounts of web data. Generally speaking, a complete search e...
متن کاملSubgraphRank: PageRank Approximation for a Subgraph or in a Decentralized System
PageRank, a ranking metric for hypertext web pages, has received increased interests. As the Web has grown in size, computing PageRank scores on the whole web using centralized approaches faces challenges in scalability. Distributed systems like peer-to-peer(P2P) networks are employed to speed up PageRank. In a P2P system, each peer crawls web fragments independently. Hence the web fragment on ...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملSearch optimization technique for Domain Specific Parallel Crawler
Architectural framework of World Wide Web is used for accessing linked documents spread out over millions of machines all over the Internet. Web is a system that makes exchange of data on the internet easy and efficient. Due to the exponential growth of web, it has become a challenge to traverse all URLs in the web documents and handle these documents, so it is necessary to optimize the paralle...
متن کامل